Replication package for 
"In the Land of AKM : Explaining the Dynamics of Wage Inequality in France"

# Overview

This replication package contains the code to prepare the data, run the estimations 
and construct tables and figures in the paper and the online appendix

The scripts are identical to the ones used for the submitted version of the paper,
except for directory path and other environment variables, comments and documentation, 
and for the "AKMland_simulation.R" script that builds and uses simulated data.

# Data

"In the Land of AKM" is built on confidential data : French exhaustive linked data on
employees, employers, jobs and wages, called BTS ("Base tous
salariés") produced by INSEE, the French national statistical institute.
The data is protected by laws on statistical secrecy and can be accessed
only within INSEE, or, for external researchers, through the CASD
(Centre d'Accès Sécurisé aux Données). 
These two access lead to slightly different versions of the data. Our estimation
results are nearly identical on both versions, 
but the published results were estimated on the INSEE files, because of the superior
INSEE computational ressources.

For reproducibility, we also provide a script that generates simulated
data, so that the code can be explored and tested even without access to the original, 
confidential data.

## Data access

The CASD websites explains how to use the data portal.
<https://www.casd.eu/en/> The documentation for the specific data source is
available at <https://www.casd.eu/en/source/all-employees-databases-employee-data/>
The use of CASD computational resources is a paid service.

### Request for access

Authorization to access the BTS data is granted by the Confidentiality
Committee : <https://www.comite-du-secret.fr/en/home/> The detailed
request procedure is explained on the CDAP (confidential data access portal) website:
<https://cdap.casd.eu/comite-secret-statistique>.

Access is for researchers only. Access box are available only in France, European Union
countries and, with additional conditions, in a few other countries including USA and Canada.

Researchers must undergo a short training session on the portal, and sign a contract. 
The full procedure might last several weeks or month, 
depending on the Confidentiality Committee session calendar.

## Data preparation

### Building the panels

We chain the annual BTS files from 2002 to 2019 to build our panels. The original
files are in the SAS data format, ans the scripts use the SAS language.
The SAS scripts are in the "pseudo_id" directory, with a dedicated readme file.

These scripts to build the panel were first shared at :
<http://olivier.godechot.free.fr/hopfichiers/pseudo_id.zip> 
in 2022 and have already been used in other research. An updated version, including
scripts adapted for the 2020 to 2023 data yearly data files, is in the directory
"pseudo_id_updated version"

The output of these script is a new variable : 
an identification number for workers that allows for matching across yearly files.

### Preparing the estimation files
Following the pseudo_id computation, the data preparation step is light :
- the "dataprep" function includes selection of variables and of observation, 
and computation of a few additional variables (yearly mean wage, etc.)
- the fst_translate function writes the yearly file in the .fst format for
better reading and writing speed.

Other data preparation, including narrower selection of observations, can be called
through parameters in estimation functions.

### Building the narrow panel
Insee already provides chained BTS files on a sample of 1/12th of the workers population.
This panel is available on a longer period (since 1976). The SAS script to build the estimation files for historical series, from
the Insee BTS narrow panel files, is:
aggregate_DADS_panel_longrun.sas

## Simulated data
Even without access to the original confidential data, you 
can run all the estimations on simulated data. 
Start with AKMland_Simulation.R. The script creates the simulated data, and
demonstrate a few basic uses. You can them run AKMland_estimations.R, our main 
estimation script, directly on this simulated data.

For cluster estimations, you must run AKMland_clusters.R to build the firm clusters, before running 
the estimation script.

The simulation function was written for appendix D.4 as a robustness check
for the split sampling procedure. The simulation does not aim to emulate all
aspects of the original data. Most notably, occupations and industry are random 
draw, without any of the correlations observed in the real data. The narrow panel used for the historical series is not simulated.
Depending on the simulation parameters, some functions might not be adapted (for instance, there might be too few firms with 20+ workers in the simulated data).


# Replication code

## Settings

### Computational requirements
We computed the estimations on INSEE data platform, which provides adaptative and high performance computing.
Data preparation was done mostly with SAS. 
Estimations are in R (part of the work has been replicated in Stata, not shared here).
RAM memory use could reach 130Go. We used a ROM memory space of 300 Go, 
allowing for manipulation of the original yearly data, the estimation
panels, estimation results, some intermediary data (firm level data, etc.), 
and historical series on the "narrow panels".

The most costly step is the estimation of the AKM two fixed-effects model. 
We use the "fixest" R package, faster than other solutions by an order of magnitude.
For faster loading of yearly files, we store them in the .fst format ("fst" R package).
To limit memory usage, we store panels and results as .parquet files.


### Using renv to install packages
Download the code, open the project and use renv::restore() in the R console.
All packages and dependencies shall then be installed in the project library.

### Directories
Data files generated with SAS scripts at the SAS format, from the
Insee servers versions, are stored in the "sasrep" directory

Files are then translated to the R fst format and saved in fstrep

Estimation results are stored in the "saverep" directory. They are then
used to generate figures (tables and graphs) in the paper, either from
the AKMland_figures.R script, or manually, as well as additional
analysis (regressions of FE, etc.)

The "narrow" (historical) panel fst file is in "fstpanel".

Firm clusters for Bonhomme Lamadon Manresa method implementation are in "clusterrep"

The full panel data with estimated fixed effects are stored in the "estimrep" directory

Files are big. The scripts attempts to minimize RAM usage, at the price
of more loading time, by loading only the data necessary for the estimation.

## Functions
Most of the code is organized in functions defined in the AKMland_functions.R script

### Main functions
#### akm_est
This is where the two fixed effects model estimation happens, 
using the "feols" function from the "fixest" package. The fixed effects variables
are hard coded as "indiv_id" and "firm_id. The function provides an option 
to use log wages averaged per year and a square polynomial of age covariate, or
to add year fixed effects and a cubic polynomial of age as covariate, 
with a linear constraint (polynomial flat at 40). This last specification 
is the one used in the paper and is taken from Card, Heining, and Kline
2013, hence the name of the method parameter in the code : "card"

#### fast_split
This function splits a panel in two parts as a preparation for a split-sample estimation.
We tried many different algorithms for splitting. They are all preserved as options
accessible with the "split_type" parameter. The methods used for the paper are the
"consec_periods" and "by_firm_movers" split_types. The function has been optimized for speed,
hence the "fast" name.

#### split_fast_decompo
This function computes the log-wage variance decomposition as a sum of terms that are
variances of the estimated fixed effects, or, in the case of split-sample estimation, as a sum
of variance and covariances of estimates between splits. This is the most complex part
of the code. It includes options for the wage yearly mean specification 
or to the year fixed effect specification, for daily or hourly wage, for split-sample correction
or without, and for computing an approximation of the variance of the residuals.
It computes two decompositions, with and without the within and between firms
decomposition introduced by Song et al. (2019). It computes descriptive statistics
relevant to each computed term of the decomposition. These computation can take time,
and the function has been optimized for speed.

#### rotestim
This function is the main wrapper function. It is called directly in the 
AKMland_estimations.R script. As a wrapper, the function fulfills three main goals :
- It takes as parameters the various specification choices (split correction, 
year fixed effects, minimum firm size, etc.)
- It runs several estimates in a row, for various periods or various specifications, 
while limiting memory use.
- It saves the results.

### Other functions
#### dataprep:
A few data preparation steps

#### fst_translate
Reads the SAS yearly datafiles and translate them into yearly .fst files for memory efficiency and fast loading and writing

#### firmdec
Computes wage variance decomposition between and within firms

#### firmquant
Computes quantiles of firm size

#### nunits
Computes various sample size statistics

#### statdes
Computes various descriptive statistics

#### detailed_statdes
More descriptive statistics (notably for regions, industries, occupations, etc.)

#### fast_split
See above

#### akm_est
See above

#### merge_fe
Get the estimated fixed effects from the feols function results, and merge them back into the panel file.

#### get_u
get residuals and auxilliary variables parameters (Xb) from the feols function results

#### split_est
Calls the function akm_est on both splits of a split sample

#### all_est
Calls either akm_est of split_est depending on the method (split sample or not)

#### getvar
small function used in split_fast_decompo

#### crossvar
small function used in split_fast_decompo

#### crosscov
small function used in split_fast_decompo

#### split_fast_decompo
See above

#### decompo
Small wrapper function for split_fast_decompo

#### estim
Runs an akm estimation, then a wage variance decomposition using the AKM fixed effects, computes descriptive statistiques and saves the results. This function adapts to alternative specifications (split sample or not, etc.)

#### fast_panel
Builds a panel from yearly .fst files

#### rotestim
See above

#### multisplit
Runs rotestim multiple times with a different random split each time

#### cleanestim
Prepares a panel datafile with estimated FE for regression with fereg_period

#### fereg_period
Regression of fixed effects on various explnatory variables

#### occ_FE
Computes the occupational fixed effect as the average workers fixed effect for the occupation

#### keepoccstat
Computes descriptives statistics on occupations for each split

#### occ_decomp
Computes the wage variance decomposition including the occupational fixed effect variance and covariance terms

#### firmdt
Creates a firm-level file for firm-level analysis, from the panel datafile with estimated FE.

#### yearlyselect
Small sample selection function used in period_quantile

#### period_quantile
Computes wage quantiles for each year

#### hist_dataprep
Aligns variable names in the narrow panel files on variable names in the exhaustive panel files

#### varselect
Computes the wage variable of interest (hourly, daily, gross or net) for the historical series

#### hist_build_panel
Builds the panel for estimation, for the historical series

#### hist_rotmulti
Multiple split-estimations with a different random split each time, for the historical series

#### mobsim
Simulates data from scratch. Can be used to test all other functions with fake data. The wage equation is realistic. However, the simulated mobility network is
very different from real mobility networks.

#### wagesim
Simulates wages from real mobility data. The goal is to have simulated data with know latent fixed effects, with a realistic mobility network.

#### simcomp
Compares estimates on simulated data (from wagesim) with known latent FE values.

#### size_decomp
Wage decomposition results separated by firm size category

#### firm_network
Builds the graph of firms linked by workers mobility

#### network_centrality
Computes centrality measures in the mobility network of firms for each firm

#### centrality_stat
Computes summary statistics on centrality overall and for each firm size category

#### build_panel
Builds a panel from original .fst yearly files, to be used in year2_reduce

#### year2_reduce
Prepare the panel data for the RBLM package. Most importantly, the RBLM packages uses only 2 years of observation for each firm, so this function selects only two years for each firm, even if the panel is longer than 2 years.

## Main estimations
The AKMland_estimations.R script computes the estimations for the paper and the appendices

## Additionnal scripts
Some additional analysis are in specific scripts :
- AKMland_occupation_change.R measures the rate of occupational transitions
per year.
- AKMland_clusters.R computes and saves the clusters of firms, used in the BLM (Bonhomme Lamadon Manresa 2019)-style estimation.

# Paper
Insee working paper :
https://www.insee.fr/en/statistiques/7675646

Banque de France working paper :
https://www.banque-france.fr/en/publications-and-statistics/publications/land-akm-explaining-dynamics-wage-inequality-france

